Hurricane Prediction Model

Ryan Abramowitz, H. Katrina Alcala, SooHoon Choi, Taegeun Ohe, and Connor Owen


Background

Hurricanes are thermally driven, rapidly rotating storm systems characterized by a low-pressure center whose wind speed exceeds 74 miles per hour. In 2005, Hurricane Katrina, a large Category 5 Atlantic hurricane, caused over 1,800 deaths and \$125 billion in damage. The top 10 costliest hurricanes in the United States occurred in 21st century. According to real estate analytics firm CoreLogic Inc, more than 32 million homes are at risk of hurricane damage on the Atlantic and Gulf Coasts, with a combined value of \\$8.5 trillion.

Problem

The goal of this project is to accurately predict the hurricane trajectories or track forecasting which will identify the location and the intensity of the hurricane by utilizing diverse data sources to reduce economic damages and save lives. Additionally, the class of each tropical cyclone will be predicted to prevent from unnecessary evacuations or unnecessary resources supply (e.g. a tropical cyclone that never develops into a hurricane). A set of predicted models can lower errors and forecast a few days ahead.

Table of Contents:

Cleaning the Data

The dataset used for this project is the Atlantic Hurricane Database, which includes tropical storm and hurricane observations, and is maintained by the National Hurricane Center (NHC). The database includes entries dating as far back as 1851 and include numerous features including longitude, latitude, and windspeed of the storm at given times. Pandas was used to clean the data entries of cyclones with missing feature data. Each storm was then classified as a type of tropical cyclone, extratropical cyclone, subtropical cyclone of subtropical depression, low, or tropical wave.

The chart shown above shows a sample of the cleaned dataset with each storm's respective status stored under the column Status_Str.

Mapping Storm Trajectory

In order to visualize the hurricane trajectories with respect to wind speed, pressure, and other variables described above, the first 5 hurricanes and their trajectories are plotted using Plotly and Pandas.

Feature Analysis

In order to lower the cost of running any trajectory prediction algorithm, it would be wise to determine the importance each feature in the dataset. Removing unnecessary features will save time and computational resources in future iterations.

Histogram of Features

Below is a histogram showing the frequency of each feature in the dataset.

Heatmap of Features

In order to visualize the correlation between features, a Pearson heatmap was generated in which the correlation values range from -1 (max negative correlation) to 1 (max positive correlation). Based on the heat map shown, we can make a few assumptions:

PCA Analysis

PCA is used to reduce the number of dimensions in a dataset while retaining the most information by breaking the dataset down into its principal components. The sklearn and seaborn packages were used in this section to perform the analysis.

The chart above shows that PC1 explains more than 60% of the variance in the data. However, we will want a model that will take more variance into account. Next, we will determine how many principal components are necessary to capture at least 90% of the data's variance.

Feature Analysis Conclusion

Based on the feature analysis done above, it has been shown that for the dataset examined a total of 6 principal components are necessary to maintain 90% of the dataset's variance. This is a 40% reduction from the original number of components.

Classification

One of the goals of this project is to choose a classification algorithm which can be used to classify a future hurricane's status. The algorithms that will compared in this section are KNN, PCA, SVM, Decision Tree, Random Forest, and Gaussian Naive Bayes.

KNN Classification

The K-Nearest Neighbors (KNN) algorithm determines a point's classification by examining the k-closest training samples in the dataset. This section uses SMOTE() which is a technique to help with class balancing when classes have low sample counts by existing samples in the dataset to create synthetic samples. By balancing the classes, we can improve the metrics (precision and recall) on the minority classes.

KNN using SMOTE-Balanced Data

KNN using PCA and SMOTE-Imbalanced Data

Principle Component Analysis (PCA) combined with KNN will reduce the dimensionality of the model and results in lower compute times. Since PCA eliminates redundant information, combining it with KNN will result in improved performance.

KNN using PCA and SMOTE-Balanced Data

KNN using PCA Analysis with Random Oversampling Balanced Data

The chart below shows the results for the previous 4 techniques. When comparing the previous KNN/PCA methods it is apparent that KNN using SMOTE-Balanced Data produces the largest KNN Score. However, including PCA will help prevent oversampling and reduce runtime. For larger datasets and more complex algorithms, it will be benificial for to use PCA. For the next classification techniques, we will be incorporating PCA.

Method Optimal K KNN Score Time (s)
KNN using SMOTE-Balanced Data 4 0.90996 6.83
KNN using PCA and SMOTE-Imbalanced Data 16 0.78840 5.42
KNN using PCA and SMOTE-Balanced Data 3 0.68599 5.96
KNN using PCA and Random Oversampling Balanced Data 6 0.72312 5.43

SVM Classification

Support Vector Machines (SVMs) can be used for data classification. Below we attempt to classify the status of a hurrican using the sklearn libraries.

Decision Tree Classification

A decision tree is non-parametric supervised learning method used for classification which we will use for classifying hurricane status. The tree's goal is to create a model that predicts the status by learning simple decision rules created from the feature data.

Decision Tree using Unbalanced Data

Decision Tree using SMOTE Balanced Data

Decision Tree using Randomly Oversampled Balanced Data

Decision Tree Method Accuracy Time (s)
Decision Tree using Unbalanced Data 0.78672 0.025
Decision Tree using SMOTE Balanced Data 0.64322 0.055
Decision Tree Random Oversampling Balanced Data 0.72201 0.031

Random Forest Classification

Random forest is a classification works by harnessing the power of many decisions trees. It uses bagging and feature randomness when building each individual tree to create an uncorrelated forest of trees. The predictions generated by random forest should prove to be more accurate than that of a decision tree.

Random Forest using Unbalanced Data

Random Forest using SMOTE Balanced Data

Random Forest using Randomly Oversampled Balanced Data

Random Forest Method Accuracy Time
Random Forest using Unbalanced Data 0.81429 0.073
Random Forest using SMOTE Balanced Data 0.66573 0.074
Random Forest Random Oversampling Balanced Data 0.74001 0.070

Gaussian Naive Bayes Classification

Gaussian Naive Bayes classification uses supervised learning algorithms to applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. Below is an implementation of this classification technique using sklearn.

Classification Conclusion

Method Accuracy Time
KNN using SMOTE-Balanced Data 0.90996 6.83
KNN using PCA and SMOTE-Imbalanced Data 0.78840 5.42
KNN using PCA and SMOTE-Balanced Data 0.68599 5.96
KNN using PCA and Random Oversampling Balanced Data 0.72312 5.43
SVM using Unbalance Data 0.79910 1897.75
Decision Tree using Unbalanced Data 0.78672 0.025
Decision Tree using SMOTE Balanced Data 0.64322 0.055
Decision Tree Random Oversampling Balanced Data 0.72201 0.031
Random Forest using Unbalanced Data 0.81429 0.073
Random Forest using SMOTE Balanced Data 0.66573 0.074
Random Forest Random Oversampling Balanced Data 0.74001 0.070
Gaussian Naive Bayes 0.60101 0.010

When looking at the results of the classification algorithms using unbalanced data, all the classification algorithms resulted in a decent score (close to 80%) except for the Gaussian Naive Bayes method. Training the models on the class balanced data (whether with SMOTE or random oversampling) helps with precision and recall on the minority classes, while the overall weighted precision goes down. Depending on the use case, the user would need to know the limitations of the model. SVM with gridsearch takes a significant amount of time and should not be used for our purposes. The score is only slightly higher on average but not significantly different. The most reasonable method for classification based on the results is Random Forest.

Hurricane Trajectory Prediction with a RNN

In order to predict hurricane trajectory, the group chose to use a recurrent neural network (RNN) which is implemented below using tools from the tensorflow library. First, the dataset was filtered to only use hurricane data (removing cyclones, tropical storms, etc). This dataset was then split into a test and train set into order train and test the model.

Creating the Model

Two sequential models were created to predict the hurricane trajectories. One is a simple sequential model while the other is 2-stacked where both used layering methods from the keras API. Once the models were built, they were then trained using the previously described datasets one hurricane at a time.

Plotting the Results

Finally, we used plotly to map the predictions. The results are be shown below.